# Interpretability (machine learning)

> The ability to explain or to present an ML models reasoning in understandable terms to a human.

The ability to explain or to present an ML [models](https://wiki.g15e.com/pages/Model%20(machine%20learning.txt)) reasoning in understandable terms to a human.

Most [linear regression](https://wiki.g15e.com/pages/Linear%20regression.txt) models, for example, are highly interpretable. (You merely need to look at the trained [weights](https://wiki.g15e.com/pages/Weight%20(machine%20learning.txt)) for each [feature](https://wiki.g15e.com/pages/Feature%20(machine%20learning.txt)).) Decision forests are also highly interpretable. Some models, however, require sophisticated <visualization> to become interpretable.

## Articles

- <2025-11-14> - [Understanding neural networks through sparse circuits | OpenAI](https://openai.com/index/understanding-neural-networks-through-sparse-circuits/)
- <2025-08-02> - [Persona vectors: Monitoring and controlling character traits in language models \ Anthropic](https://www.anthropic.com/research/persona-vectors)